Education. Work. Money: three words almost guaranteed to be at the forefront of a student’s mind as they contemplate their future upon leaving highschool. With more and more opportunities for anyone to pursue almost any degree in any field, the inevitable question of “Which major should I take?:” is becoming a harder and harder choice for students across the world.
In this research project, data from over 6.7 million college graduates in the USA has been analysed to examine key questions regarding:
Data shows that Engineering majors often see the highest levels of income, followed by other major categories such as Business and Law. In particular, Petroleum Engineering stands over the rest of the majors with a median income of USD$110,000, compared to a median of $36,000.
Further, UNEMPLOYMENT.
Additionally, POPULARITY.
In 2012, the median personal income for the US was $28,213, with unemployment at 8.1% for 2012.
Recently, a research article has shown that “since the [GFC]… students have turned away from the humanities and towards job-oriented degrees” (Kopf, 2018), with the share of degrees in history dropping from 2% 2007 to 1% 2017 (Kopf, 2018). This seems to reflect “a new set of student priorities… formed even before they see the inside of a college classroom… Students [are] fleeing humanities and related fields specifically because they think they have poor job prospects.” (Schmidt, 2018).
The data was collected from the American Community Survey 2010 - 2012 Public Use Microdata Sample Files (PUMS) at the USA Census Website. It was initially wrangled by media company FiveThirtyEight (a part of ABC News Internet Ventures), with code accessible here.
The US Bureau of the Census is a government body, and although FiveThirtyEight had commercial interests, their process of data wrangling was highly transparent and reproducible. Therefore these sources can be considered reliable.
The Census Bureau produces the PUMS as an inexpensive and accessible datasource for students and social scientists, while FiveThirtyEight wrangled this data for commercial use in their article The Economic Guide to Picking a College Major, aimed at educating students on how to choose their college majors.
Drawing upon the domain knowledge, this data is particularly relevant to highschool leavers trying to choose a major, as well as current students contemplating their career prospects, as it may help them make a more informed economic decision.
University staff, intership firms, or other organisations may also find benefit in predicting the future direction of the workforce, allowing for better resource allocation, such as investment into engineering and STEM fields.
str(gradData)
## 'data.frame': 173 obs. of 21 variables:
## $ Rank : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Major_code : int 2419 2416 2415 2417 2405 2418 6202 5001 2414 2408 ...
## $ Major : Factor w/ 173 levels "ACCOUNTING","ACTUARIAL SCIENCE",..: 141 116 113 132 24 134 2 15 109 53 ...
## $ Total : int 2339 756 856 1258 32260 2573 3777 1792 91227 81527 ...
## $ Men : int 2057 679 725 1123 21239 2200 2110 832 80320 65511 ...
## $ Women : int 282 77 131 135 11021 373 1667 960 10907 16016 ...
## $ Major_category : Factor w/ 16 levels "Agriculture & Natural Resources",..: 8 8 8 8 8 8 4 14 8 8 ...
## $ ShareWomen : num 0.121 0.102 0.153 0.107 0.342 ...
## $ Sample_size : int 36 7 3 16 289 17 51 10 1029 631 ...
## $ Employed : int 1976 640 648 758 25694 1857 2912 1526 76442 61928 ...
## $ Full_time : int 1849 556 558 1069 23170 2038 2924 1085 71298 55450 ...
## $ Part_time : int 270 170 133 150 5180 264 296 553 13101 12695 ...
## $ Full_time_year_round: int 1207 388 340 692 16697 1449 2482 827 54639 41413 ...
## $ Unemployed : int 37 85 16 40 1672 400 308 33 4650 3895 ...
## $ Unemployment_rate : num 0.0184 0.1172 0.0241 0.0501 0.0611 ...
## $ Median : int 110000 75000 73000 70000 65000 65000 62000 62000 60000 60000 ...
## $ P25th : int 95000 55000 50000 43000 50000 50000 53000 31500 48000 45000 ...
## $ P75th : int 125000 90000 105000 80000 75000 102000 72000 109000 70000 72000 ...
## $ College_jobs : int 1534 350 456 529 18314 1142 1768 972 52844 45829 ...
## $ Non_college_jobs : int 364 257 176 102 4440 657 314 500 16384 10874 ...
## $ Low_wage_jobs : int 193 50 0 0 972 244 259 220 3253 3170 ...
This data consists of 20 variables (excluding “Rank” which orders the subjects by Median income), however, only xx variables are relevant for the study:
A unique code for each major, given by the source.
Type: Integer
Assessment: Although it is a number, a factor classification would be more suitable as the codes are considered nominal (no order).
The major’s name.
Type: Factor
Assessment: Either a character or factor classification would be suitable.
Amount of total people, men, and women respectively with that major in the sample for 2010-2012.
Type: Integer
Assessment: Suitable.
General category for that major (e.g. “Engineering”).
Type: Factor
Assessment: Suitable - allows for easy classification and plotting.
Sample size for calculating income quartiles.
Type: Integer
Assessment: Suitable.
Number of people employed, employed 35 hours or more per week, and employed 35 hours or less respectively.
Type: Integer
Assessment: Suitable.
Number of people employed for at least 50 weeks per year and over 35 hours hours per week.
Type: Integer
Assessment: Suitable.
Number of people considered unemployed by census data.
Type: Integer
Assessment: Suitable.
The percentage of people unemployed over (unemployed + employed).
Type: Number
Assessment: Suitable.
Median, 25th percentile, and 75th percentile earnings respectively for full-time, year-round workers (in USD).
Type: Integer
Assessment: Suitable - although income is continuous, it can be considered discrete without significantly impacting the data.
Number of people with a job requiring a college degree, not requiring a college degree, and in a low-wage service job respectively.
Type: Integer
Assessment: Suitable.
Possible Issues:
Validity:
This data, taking into account the issues above and their solutions, can be considered valid. However, care must be taken to acknowledge confounders, such as personality and circumstance, rather than just major choice, in influencing the variables.
Which college major should a student take to receive the highest income?
There are three variables to consider - the 25th percentile, median, and 75th percentile incomes. Additionally, it is important to consider both individual majors and major categories. Taking a summary initially shows that there is a significant range of incomes:
summary(gradData$Median)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 22000 33000 36000 40151 45000 110000
hline <- function(y = 0) {
list(
type = "line",
x0 = 0,
x1 = 1,
xref = "paper",
y0 = y,
y1 = y,
line = list(color = "red", width=1)
)
}
plot_ly(gradData, y=~Median/1000, color=~Major_category, type="box") %>%
layout(
yaxis = list(title = "Median income (USD$1000)"),
xaxis = list(showticklabels = FALSE),
title = "Median Income per Major Category",
shapes = list(hline(36)))
Plotting the median income against major category backs up the summary - showing a large spread, centred around the median of $36,000. # Selects top 10
gradData.head = head(gradData, n=10)
# Creates a new data frame to be able to plot median, 25th, and 75th percentiles on the same graph
median.df = data.frame(Major = gradData.head$Major, Major_category=gradData.head$Major_category,
Income = gradData.head$Median)
p25.df = data.frame(Major = gradData.head$Major, Major_category=gradData.head$Major_category,
Income = gradData.head$P25th)
p75.df = data.frame(Major = gradData.head$Major, Major_category=gradData.head$Major_category,
Income = gradData.head$P75th)
gradData.head.df = rbind(median.df, p25.df, p75.df)
# Selects bottom 10
gradData.tail = tail(gradData, n=10)
# Creates a new data frame to be able to plot median, 25th, and 75th percentiles on the same graph
median.df = data.frame(Major = gradData.tail$Major, Major_category=gradData.tail$Major_category,
Income = gradData.tail$Median)
p25.df = data.frame(Major = gradData.tail$Major, Major_category=gradData.tail$Major_category,
Income = gradData.tail$P25th)
p75.df = data.frame(Major = gradData.tail$Major, Major_category=gradData.tail$Major_category,
Income = gradData.tail$P75th)
gradData.tail.df = rbind(median.df, p25.df, p75.df)
score = rbind(gradData.tail, gradData.tail, gradData.tail)
# Combines the two
gradData.combined.df =rbind(gradData.head.df, gradData.tail.df)
# Orders it
score = rbind(gradData.head, gradData.head, gradData.head, score)
# Plots a boxplot
plot_ly(gradData.combined.df, y=~Income/1000, x=~reorder(Major, -score$Median), color=~Major_category, type="box") %>%
layout(
yaxis = list(
title = "Median income (USD$1000)",
autotick = FALSE,
ticks = "outside",
tick0 = 0,
dtick = 10,
ticklen = 3,
tickwidth = 1,
tickwidth = 1),
xaxis = list(
showticklabels = TRUE, title="",
tickangle = 270, tickfont = list(size = 10)),
title = "Top 10 and Bottom 10 Majors by Median Income")
Looking at individual majors, there are initially too many data-points to make sense of the information. Instead, ordering the data by median income, the subjects can be limited to only the top and bottom 10 majors (note that this plotted data takes into account median, 25th percentile, and 75th percentile). While 9 of the top 10 majors belong to the Engineering category, the bottom 10 majors are considerably more varied.
# Selects all of gradData
combined = gradData
# Combines median, 25th percentile, and 75th percentile into one data frame for plotting
median.df = data.frame(Major = combined$Major, Major_category=strtrim(combined$Major_category, 45),
Income = combined$Median)
p25.df = data.frame(Major = combined$Major, Major_category=strtrim(combined$Major_category, 45),
Income = combined$P25th)
p75.df = data.frame(Major = combined$Major, Major_category=strtrim(combined$Major_category, 45),
Income = combined$P75th)
combined.df = rbind(median.df, p25.df, p75.df)
# Plots a density chart
ggplotly(ggplot(combined.df, aes(x=Income/1000, fill=Major_category)) + geom_density(alpha=0.2) +
facet_wrap(~Major_category) +
xlab("Income (USD$1000)") + ylab("Density") +
labs(title="Income Distribution per Major Category") + theme_minimal() +
theme(legend.position="none", strip.text.x = element_text(size = 7),
axis.text.y = element_blank()))
# Selects only Engineering, education and business majors
combined = rbind(gradData[gradData$Major_category=="Engineering",],
gradData[gradData$Major_category=="Education",],
gradData[gradData$Major_category=="Business",])
# Combines median, 25th percentile, and 75th percentile into one data frame for plotting
median.df = data.frame(Major = combined$Major, Major_category=combined$Major_category,
Income = combined$Median)
p25.df = data.frame(Major = combined$Major, Major_category=combined$Major_category,
Income = combined$P25th)
p75.df = data.frame(Major = combined$Major, Major_category=combined$Major_category,
Income = combined$P75th)
combined.df = rbind(median.df, p25.df, p75.df)
# Plots a density chart
ggplotly(ggplot(combined.df, aes(x=Income/1000, fill=Major_category)) + geom_density(alpha=0.2) +
xlab("Income (USD$1000)") + ylab("Density") +
labs(fill="Major Category", title="Income Distribution per Major Category (Selection)") +
theme_minimal())
Examining the density estimation of a selection of major categories, again Engineering appears to have significantly higher incomes compared to other categories. However, the estimation shows that the spread is also significantly larger, with a portion of the income falling within the range of the lowest majors. This is contrasted with Education, where the range is confined to ~$25,000.
# Coefficient of Variation for the Engineering sample's incomes
message("Coefficient of Variation for Engineering: ", sd(combined.df[combined.df$Major_category=="Engineering",]$Income)/
mean(combined.df[combined.df$Major_category=="Engineering",]$Income))
## Coefficient of Variation for Engineering: 0.329608913424962
# Coefficient of Variation for the Education sample's incomes
message("Coefficient of Variation for Education: ", sd(combined.df[combined.df$Major_category=="Education",]$Income)/
mean(combined.df[combined.df$Major_category=="Education",]$Income))
## Coefficient of Variation for Education: 0.20893468637973
This is re-iterated by the coefficient of variation for Engineering being over 150% of Education’s.
Summary:
The data shows that Engineering incomes can far exceed those in other categories, with Petroleum Engineering in particular being significantly higher than the other majors. Indeed, the separation of Petroleum Engineering from the other top 10 median incomes is comparable to the separation of the top 10 from the bottom 10. However, Engineering incomes overall have a significantly larger spread than the other categories, implying a volatility either between majors or within the industries themselves. Nevertheless, students seeking high incomes may be best suited to look towards Engineering fields.
Which college major should a student pursue to see the greatest prospects for employment? Which college majors prove most useful in achieving employment?
Look at variables related to employment (both employment chances, and the need for a college degree in relation to employment)
Compare unemployment rates between majors and major categories Compare worth of college degree in achieving employment
The fact that this data is for <28 changes things lots. It’s bascially entry level income and does not reflect increasing income over time.
Insert text and analysis.
percentCollege = gradData$College_jobs/(gradData$College_jobs+gradData$Non_college_jobs+gradData$Low_wage_jobs)
ggplot(gradData, aes(x=factor(Major_category), y=percentCollege)) +
geom_boxplot(aes(fill = factor(Major_category))) +
theme(axis.text.x = element_blank(), axis.title.x = element_blank(), axis.ticks.x = element_blank()) +
labs(fill = "Major Category", title="Percent College Jobs per Major Category") +
ylab("Percentage of Jobs as College Jobs")
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).
How much of this employment is actually in their field / based on the degree (looking at college jobs vs non-college jobs)?
Looking at the results from Q1 and Q2, how do these “rankings” align with the popularity of these courses?
From Questions 1 and 2, the majors falling into the engineering category have consistently ranked amongst the highest in terms of income. By examining the popularity of courses, a stronger conclusion should be able to be made regarding the reasons for this apparent incongruity.
gradData = read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/recent-grads.csv")
library(ggplot2)
library(plotly)
# Exclude Food Science and Military Technologies, as they have missing data.
# total_grads = sum(gradTotalData$Total)
# Order our data by the total number of graduates
gradTotalData = gradData[!(gradData$Major=="FOOD SCIENCE") & !(gradData$Major=="MILITARY TECHNOLOGIES") ,]
gradTotalData = gradTotalData[order(gradTotalData$Total),]
bottom_ten_by_total = head(gradTotalData, 10)
top_ten_by_total = tail(gradTotalData, 10)
gradTotal_topbottom = rbind(top_ten_by_total, bottom_ten_by_total)
ggplotly(ggplot(gradTotalData, aes(x=Median, fill=Major_category)) + geom_histogram() + theme(legend.position = "none") +
xlab("Median Income") + ylab("Density") + labs(title="Popularity vs Median Income"))
Initially, by looking at the distribution of Median income across major categories, it’s apparent that more specialised fields (lower counts) tend to generate a higher income. By examining the categories (differentiated by colour, and named on hover), the data shows that engineering has a higher frequency in higher median income areas.
ggplotly(ggplot(gradTotalData, aes(x=Employed/Total, fill=Major_category)) + geom_histogram() + theme(legend.position = "none") +
xlab("Employment Rate") + ylab("Density") + labs(title="Popularity vs Employment Rate"))
A histogram of employment rate shows a fairly normal distribution of major categories. Interestingly, engineering holds not only the highest median income major, but also the major with the highest employment rate of graduates. Unlike income, however, employment rates of engineering majors are observed to have much more of a spread. As show below, the employment rate of engineering majors at worst is 2.75 deviations from the mean.
employmentMean = mean((gradTotalData$Employed/gradTotalData$Total))
employmentSd = sd((gradTotalData$Employed/gradTotalData$Total))
# Get all employment rates of engineering, then sort them in increasing order, and take the first element to get the lowest.
lowestEmployment = sort((gradTotalData$Employed/gradTotalData$Total)[gradTotalData$Major_category=="Engineering"], decreasing=FALSE)[1]
distance = ((employmentMean - lowestEmployment)/employmentSd)
message("Z Score for Engineering Unemployment: ", distance)
## Z Score for Engineering Unemployment: 2.75463404795189
Summary:
Bureau of Labor Statistics. (2019). Labor Force Statistics from the Current Population Survey, 2012 (LNU04000000) [Data set]. Retrieved from http://data.bls.gov.
Casselman, B. (2014, September 12). The Economic Guide to Picking a College Major. FiveThirtyEight. Retrieved from https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/.
FiveThirtyEight. (2014). College Majors 2010-2012 (Recent Grads) [Data set]. Retrieved from Github; https://github.com/fivethirtyeight/data/tree/master/college-majors.
Kopf, D. (2018, August 29). The 2008 financial crisis completely changed what majors students choose. Quartz. Retrieved from https://qz.com/1370922/the-2008-financial-crisis-completely-changed-what-majors-students-choose/.
Schmidt, B. (2018, August 3). The Humanities Are in Crisis. The Atlantic. Retrieved from https://www.theatlantic.com/ideas/archive/2018/08/the-humanities-face-a-crisisof-confidence/567565/.
US. Bureau of the Census. (2018). Public Use Microdata Samples (PUMS) Documentation. Retrieved from https://www.census.gov/programs-surveys/acs/technical-documentation/pums.html.
U.S. Bureau of the Census. (2017). Real Median Personal Income in the United States, 2012 (MEPAINUSA672N) [Data set]. Retrieved from FRED, Federal Reserve Bank of St. Louis; https://fred.stlouisfed.org/series/MEPAINUSA672N.
sessionInfo()
## R version 3.5.2 (2018-12-20)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] bindrcpp_0.2.2 plotly_4.8.0 ggplot2_3.1.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.0 RColorBrewer_1.1-2 later_0.8.0
## [4] pillar_1.3.1 compiler_3.5.2 plyr_1.8.4
## [7] bindr_0.1.1 tools_3.5.2 digest_0.6.18
## [10] viridisLite_0.3.0 jsonlite_1.6 evaluate_0.12
## [13] tibble_2.0.1 gtable_0.2.0 pkgconfig_2.0.2
## [16] rlang_0.3.1 shiny_1.2.0 crosstalk_1.0.0
## [19] yaml_2.2.0 xfun_0.4 withr_2.1.2
## [22] dplyr_0.7.8 stringr_1.4.0 httr_1.4.0
## [25] knitr_1.21 htmlwidgets_1.3 grid_3.5.2
## [28] tidyselect_0.2.5 glue_1.3.0 data.table_1.12.0
## [31] R6_2.3.0 rmarkdown_1.11 tidyr_0.8.2
## [34] purrr_0.3.0 magrittr_1.5 promises_1.0.1
## [37] scales_1.0.0 htmltools_0.3.6 assertthat_0.2.0
## [40] xtable_1.8-3 mime_0.6 colorspace_1.4-0
## [43] httpuv_1.4.5.1 labeling_0.3 stringi_1.2.4
## [46] lazyeval_0.2.1 munsell_0.5.0 crayon_1.3.4